Unsupervised Analysis of Days of Week

Treating crossings each day as features to learn about the relationships between various days.



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn; sn.set()
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from jupyterworkflow.data import get_freemont_data

Get Data



In [2]:

    
data = get_freemont_data()
data.head()









    Out[2]:






  
    
      
      East
      West
      Total
    
    
      Date
      
      
      
    
  
  
    
      2012-10-03 00:00:00
      4.0
      9.0
      13.0
    
    
      2012-10-03 01:00:00
      4.0
      6.0
      10.0
    
    
      2012-10-03 02:00:00
      1.0
      1.0
      2.0
    
    
      2012-10-03 03:00:00
      2.0
      3.0
      5.0
    
    
      2012-10-03 04:00:00
      6.0
      1.0
      7.0



In [3]:

    
data.resample('W').sum().plot()









    Out[3]:





<matplotlib.axes._subplots.AxesSubplot at 0xc7b9780>



In [4]:

    
ax = data.resample('D').sum().rolling(365).sum().plot();
ax.set_ylim(0, None);



In [5]:

    
data.groupby(data.index.time).mean().plot();



In [6]:

    
pivoted = data.pivot_table('Total', index=data.index.time,
                          columns=data.index.date)
pivoted.iloc[:5, :5]



In [7]:

    
pivoted.plot(legend=False, alpha=0.01);



In [8]:

    
X = pivoted.fillna(0).T.values 
X.shape









    Out[8]:





(1610, 24)

Principal Component Analysis (PCA)



In [9]:

    
X2 = PCA(2, svd_solver='full').fit_transform(X)
X2.shape









    Out[9]:





(1610, 2)

Unsupervised Clustering



In [10]:

    
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(2)
gmm.fit(X)
labels = gmm.predict(X)
labels









    Out[10]:





array([0, 0, 0, ..., 1, 0, 0], dtype=int64)



In [11]:

    
import matplotlib.pyplot as plt
plt.scatter(X2[:,0], X[:,1], c=labels, cmap='rainbow')
plt.colorbar()









    Out[11]:





<matplotlib.colorbar.Colorbar at 0xe624978>



In [12]:

    
fix, ax = plt.subplots(1, 2, figsize=(14, 6))
pivoted.T[labels == 0].T.plot(legend=False, alpha=0.1, ax=ax[0]);
pivoted.T[labels == 1].T.plot(legend=False, alpha=0.1, ax=ax[1]);

ax[0].set_title('Purple Cluster');
ax[1].set_title('Red Cluster');

Comparing with Day of Week



In [13]:

    
dayofweek = pd.DatetimeIndex(pivoted.columns).dayofweek



In [14]:

    
plt.scatter(X2[:,0], X[:,1], c=dayofweek, cmap='rainbow')
plt.colorbar()









    Out[14]:





<matplotlib.colorbar.Colorbar at 0xf8ce7f0>

Analyzing Outliers

The following points are weekdays with a holiday-like pattern



In [15]:

    
dates = pd.DatetimeIndex(pivoted.columns)
dates[(labels == 1) & (dayofweek < 5)]









    Out[15]:





DatetimeIndex(['2012-11-22', '2012-11-23', '2012-12-24', '2012-12-25',
               '2013-01-01', '2013-05-27', '2013-07-04', '2013-07-05',
               '2013-09-02', '2013-11-28', '2013-11-29', '2013-12-20',
               '2013-12-24', '2013-12-25', '2014-01-01', '2014-04-23',
               '2014-05-26', '2014-07-04', '2014-09-01', '2014-11-27',
               '2014-11-28', '2014-12-24', '2014-12-25', '2014-12-26',
               '2015-01-01', '2015-05-25', '2015-07-03', '2015-09-07',
               '2015-11-26', '2015-11-27', '2015-12-24', '2015-12-25',
               '2016-01-01', '2016-05-30', '2016-07-04', '2016-09-05',
               '2016-11-24', '2016-11-25', '2016-12-26', '2017-01-02',
               '2017-02-06'],
              dtype='datetime64[ns]', freq=None)

	East	West	Total
Date
2012-10-03 00:00:00	4.0	9.0	13.0
2012-10-03 01:00:00	4.0	6.0	10.0
2012-10-03 02:00:00	1.0	1.0	2.0
2012-10-03 03:00:00	2.0	3.0	5.0
2012-10-03 04:00:00	6.0	1.0	7.0

	2012-10-03	2012-10-04	2012-10-05	2012-10-06	2012-10-07
00:00:00	13.0	18.0	11.0	15.0	11.0
01:00:00	10.0	3.0	8.0	15.0	17.0
02:00:00	2.0	9.0	7.0	9.0	3.0
03:00:00	5.0	3.0	4.0	3.0	6.0
04:00:00	7.0	8.0	9.0	5.0	3.0